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Abstract 

MultiDendrograms is a Java-written application that computes agglomerative 
hierarchical clusterings of data. Starting from a distances (or weights) ma- 
trix, MultiDendrograms is able to calculate its dendrograms using the most 
common agglomerative hierarchical clustering methods. The application im- 
plements a variable-group algorithm that solves the non-uniqueness problem 
found in the standard pair-group algorithm. This problem arises when two 
or more minimum distances between different clusters are equal during the 
agglomerative process, because then different output clusterings are possible 
depending on the criterion used to break ties between distances. MultiDen- 
drograms solves this problem implementing a variable-group algorithm that 
groups more than two clusters at the same time when ties occur. 

Keywords: Hierarchical classification, Agglomerative algorithms, Ties in 
proximity, Dendrogram 



1. Introduction 



Agglomerative hierarchical clustering (Cormack, 1971; Gordon, 1999 Sneath 



and Sokal, 1973) starts from a proximity matrix between individuals, each 



one forming a singleton cluster, and gathers clusters into groups of clusters 
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or superclusters, the process being repeated until a complete hierarchy of 
partitions into clusters is formed. Among the different types of agglomera- 
tive methods we find Single Linkage, Complete Linkage, Unweighted Aver- 
age, Weighted Average, etc., which differ in the definition of the proximity 
measure between clusters. Except for the Single Linkage case, all the other 
agglomerative hierarchical clustering techniques suffer from a non-uniqueness 
problem, sometimes called the ties in proximity problem, when the standard 
pair-group algorithm is used. This problem arises when two or more mini- 
mum distances between different clusters are equal during the agglomerative 
process. The standard approach consists in choosing any pair of clusters, 
breaking the ties between distances, and proceeds in the same way until a 
final hierarchical classification is obtained. However, different output clus- 
terings are possible depending on the criterion used to break ties. 

From the family of agglomerative hierarchical methods, Single Linkage 
and Complete Linkage are more susceptible than other methods to encounter 
ties during the clustering process, since they do not produce new proximity 
values different from the initial ones. With regard to the presence of ties in 
the original data, they are more frequent when someone works with binary 
variables, or even with integer variables comprising just some few distinct 
values. However, they can also appear using continuous variables, specially 
if the precision of experimental data is low. Sometimes, on the contrary, the 
absence of ties might be due to the representation of data with more decimal 
digits than it should be done. The non-uniqueness problem also depends on 
the measure used to obtain the proximity values from the initial variables. 



Moreover, in general, the larger the data set, the more ties arise (MacCuish 



et al. 2001). 



The ties in proximity problem is well-known from several studies in dif- 



ferent fields, for example in biology (Arnau et al. 2005 Backeljau et al. 



1996 Hart, 1983), in psychology (van der Kloot et al. 2005), or in chemistry 



(MacCuish et al. 2001). Nevertheless, this problem is frequently ignored by 



software packages (Backeljau et al. 1996 Morgan and Ray, 1995; van der 



Kloot et al. , 2005). Examples of packages which do not mention the problem 



are: the linkage function in the Statistics Toolbox of MATLAB (The Math- 



Works Inc., 2012); the hclust function in the stats package and the agnes 



function in the cluster package of R (R Development Core Team 2011); and 
the cluster and clustermat commands of Stata (StataCorp LP, 2011). 

On the contrary, some other statistical packages warn against the exis- 
tence of the non-uniqueness problem in agglomerative hierarchical clustering. 
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For instance, if there are ties, then the CLUSTER procedure of SAS (SAS In- 



stitute Inc. , 2011 ) reports the presence of ties in the SAS log and in a column 



of the cluster history. The results of a cluster analysis performed with SAS 
depend on the order of the observations in the data set, since ties are broken 
as follows: "Each cluster is identified by the smallest observation number 
among its members. For each pair of clusters, there is a smaller identifi- 
cation number and a larger identification number. If two or more pairs of 
clusters are tied for minimum distance between clusters, the pair that has 
the minimum larger identification number is merged. If there is a tie for min- 
imum larger identification number, the pair that has the minimum smaller 
identification number is merged." (SAS Institute Inc., 2011). 

Another example can be found in the Hierarchical Clustering Ana- 
lysis procedure of SPSS Statistics(IBM Corp., 2011), where it is explicitly 
stated that the results of the hierarchical clustering depend on the order of 
cases in the input file: "If tied distances or similarities exist in the input data 
or occur among updated clusters during joining, the resulting cluster solution 
may depend on the order of cases in the file. You may want to obtain several 
different solutions with cases sorted in different random orders to verify the 
stability of a given solution." (IBM Corp. 2011). 

Finally, a third example of warning comes from the Agglomerate function 



in the Hierarchical Clustering Package of Mathematica (Wolfram Research 



Inc. , 2008), where the user is briefly warned against the presence of ties by the 
following message: "Ties have been detected; reordering input may produce 
a different result." (Wolfram Research Inc., 2008). 

Therefore, software packages which do not ignore the non- uniqueness 
problem fail to adopt a common standard with respect to ties, and they sim- 
ply break ties in any arbitrary way. Here we introduce MultiDendrograms, 



an application that implements the variable-group algorithm (Fernandez and 



Gomez, 2008) to solve the non-uniqueness problem found in the standard 



pair-group approach. In Section [2] we describe the variable-group algorithm, 
which groups more than two clusters at the same time when ties occur. Sec- 
tion [3] contains a basic manual of MultiDendrograms. In Section [4] we show a 
case study performed with MultiDendrograms using data from a real exam- 
ple. Finally, in Section [5j we give some concluding remarks. 
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2. Agglomerative hierarchical algorithms 



2.1. Pair- group algorithm 

Agglomerative hierarchical procedures build a hierarchical classification 
in a bottom-up way, from a proximity matrix containing dissimilarity data 
between individuals of a set Q = {xi, . . . , x n }. (Note that the same analysis 
could be done using similarity data.) The algorithm has the following steps: 

0) Initialize n singleton clusters with one individual in each of them: {x±}, 
. . . , {x n }. Initialize also the distances between clusters, D({xi}, {xj}), 
with the values of the distances between individuals, d(xi,Xj): 



D({xi}, {xj}) = d(xi,Xj 



1) Find the shortest distance separating two different clusters. 

2) Select two clusters Aj and Aj/, subsets of Q, separated by such shortest 
distance and merge them into a new supercluster Aj U Aj/. 

between the new supercluster 



3) 



Compute the distances -D(Aj U Xi>,Xj) 
Xj U Xai and each of the other clusters A,- 



4) If all individuals are not in the same cluster yet, then go back to step 1. 



Following Sneath and Sokal (1973), this type of approach is known as a 



pair-group method, in opposition to the variable-group method which we will 
introduce in subsection 2.2 Depending on the criterion used for the calcu- 



lation of distances in step 3, different agglomerative hierarchical clusterings 
can be implemented. The most commonly used are: Single Linkage, Com- 
plete Linkage, Unweighted Average, Weighted Average, Unweighted Cen- 
troid, Weighted Centroid, and Ward's method. 



Lance and Williams (1966) put these different hierarchical strategies into 



a single system, avoiding the need of a separate computer program for each 
one of them. Assume three clusters A,, Aj/ and Xj, containing |Aj|, |Aj/| 
and | A,- 1 individuals respectively, and with distances between them already 



determined as .D(Aj,Aj/), D(X i} Xj) and D{X^,Xj) 
the smallest of all distances still to be considered is -D(Aj, Aj/), 
and Aj/ are joined to form a new supercluster Aj U Aj/, with 



Further assume that 
so that Aj 
I Aj| + |Aj/| 

individuals. Lance and Williams (1966 ) analyzed the distance D(AjUAj/, Xj) 
that appears in step 3 of the above algorithm, and they expressed it in terms 
of the distances already defined, all known at the moment of fusion, using 
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Method (Xi 7 

Single Linkage \ — \ 

Complete Linkage | +| 

IV". I 

Unweighted Average p^uTx^i ^ ^ 

Weighted Average | 

Unweighted Centroid \ x f^xA -JjSSwkf 

Weighted Centroid \ —\ 

WarH + n 

VVdm |x,|+Ly,,|+|^I iXii+i^i+i^i u 



Table 1: Parameter values for Lance and Williams' formula. 

the following recurrence relation: 

D(Xi U X^Xj) = ai D(X h Xj) + a i iD{X i i ) Xj) 

+ (3D(X h X,) + 7 | J D(X J , Xj) - DiX^X^l . (1) 

Using this technique, superclusters can always be computed from previous 
clusters and it is not necessary to look back at the original data matrix 
during the agglomerative process. The values of the parameters a^, ay, (3 
and 7 determine the nature of the clustering strategy. Table [T] gives the 
values of the parameters that define the most commonly used agglomerative 
hierarchical clustering methods. 

2.2. Variable- group algorithm 



The algorithm proposed by Fernandez and Gomez (2008 ) to ensure unique- 



ness in agglomerative hierarchical clustering has the following steps: 

0) Initialize n singleton clusters with one individual in each of them: {xi}, 
. . . , {x n }. Initialize also the distances between clusters, D({xi}, {xj}), 
with the values of the distances between individuals, d(xi,Xj): 

D({xi},{xj}) = d(xi,Xj), Vi,j = l,...,n. 

1) Find the shortest distance separating two different clusters, and record 
it as Di ower . 
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2) Select all the groups of clusters separated by shortest distance Di ower 
and merge them into several new superclusters Xj. The result of this 
step can be some superclusters made up of just one single cluster (|/| = 
1), as well as some superclusters made up of various clusters (|/| > 
1). Notice that the latter superclusters all must satisfy the condition 
£> m m(^/) = Diower, where 

£>min{Xi) = min min D{X h X v ) . 

3) Update the distances between clusters following the next substeps: 

3.1) Compute the distances D(Xj,Xj) between all superclusters, and 
record the minimum of them as D next (this will be the shortest 
distance D\ ower in the next iteration of the algorithm). 

3.2) For each supercluster Xj made up of various clusters (|/| > 1), 
assign a common agglomeration interval [Di ower , D max (Xi)} for all 
its constituent clusters X iy i G /, where 

Dmaxi^^ = max max D{X i: X v ) . 

i£l i'&I 

4) If all individuals are not in the same cluster yet, then go back to step 1. 

In the same way that several agglomerative hierarchical methods can be 
computed with the same pair-group algorithm using Lance and Williams' 
formula, Fernandez and Gomez ( 2008 ) gave a generalization of Eq. , com- 
patible with the agglomeration of more than two clusters simultaneously, that 
can be used to compute the distances D(Xj,Xj) in step 3 of the variable- 
group algorithm. Suppose we want to agglomerate two superclusters Xi and 
Xj, respectively indexed by / = {ii, . . . , i p } and J = {ji,j2, • • • , j q }- Then 
the distance between them is defined as: 

D{X T ,Xj) = Y,Y, a v D ( X " X i) 

iei jeJ 

+ E E x «) + E E ^'D{x h x r ) 

iei i'ei j£J feJ 

i'>i 

+ $ E E ^iA^rnax(Xi, Xj) — D(Xi, Xj)] 

-(! - ^E* Xi '^) ~ DminiX^Xj)} , (2) 

iei jeJ 
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Method 



0u> (/%') lij $ 



Single Linkage 
Complete Linkage 
Unweighted Average 
Weighted Average 
Unweighted Centroid 
Weighted Centroid 
Ward 



\i\\J\ 

\W\ 
1-^11-^1 

\Xi\\Xj\ 

1 

\i\\A 

\Xj\\Xj\ 
\Xi\\Xj\ 

1 

\I\\J\ 

\Xi\+\Xj\ 
\Xi\+\Xj\ 








\Xj\\M 

\Xi\ 2 

i_ 

|7|2 

\Xj\ l-Yil + l^l 
\Xj\ \Xi\+\Xj\ 



\I\\J\ 
\W\ 
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Table 2: Parameter values for Fernandez and Gomez's formula. 



where 



and 



D m ax(Xj, Xj 



max max D(Xj,Xi) 



Anm(A/,X/) = min min D(X h X 



Table |2j shows the values obtained by Fernandez and Gomez (2008) for the 
parameters a^-, (3a>, fyf, jij and 5 which determine the clustering method 
calculated with Eq. (j2j). 

Using the pair-group algorithm, only the Centroid methods (Weighted 
and Unweighted) may produce reversals. Let us remember that a reversal 
arises in a dendrogram when it contains at least two clusters X and Y for 
which X G Y but h(X) > h(Y), where h(X) is the height in the dendrogram 



at which the individuals of cluster X are merged together ( Morgan and Ray 
1995). In the case of the variable-group algorithm, reversals may appear 
in substep 3.2 when D max (Xi) > D next for any supercluster Xj. Although 
reversals make dendrograms difficult to interpret if they occur during the last 
stages of the agglomerative process, it can be argued that they are not very 
disturbing if they occur during the first stages. Thus, as happens with the 
Centroid methods in the pair-group case, it could be reasonable to use the 
variable-group algorithm as long as no reversals at all or only unimportant 
ones were obtained. 

In substep 3.2 of the variable-group clustering algorithm, sometimes it will 
not be enough to adopt a fusion interval, but it will be necessary to obtain an 
exact fusion value (e.g. in order to calculate a distortion measure). In these 
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cases, MultiDendrograms systematically uses the shortest distance, Di ower , as 
the fusion value for superclusters made up of several clusters. Such criterion 
allows the recovering of the pair-group result for the Single Linkage method 
and, in addition, it avoids the appearance of reversals. However, it must 
be emphasized that the use of exact fusion values, without considering the 
fusion intervals at their whole lengths, means that some valuable information 
regarding the heterogeneity of the clusters is being lost. 

3. MultiDendrograms manual 

This section contains a basic manual for version 3.0 of the application 
MultiDendrograms. To get the latest available version of the software, please 
visit MultiDendrograms web page at 

http : //deim.urv. cat/~sgomez/multidendrograms .php 
3.1. Input data 

MultiDendrograms needs to have input data in a compatible text file 
format. The data file must represent a distances (or weights) matrix. There 
are two different arrangements that these data can be stored in such that 
MultiDendrograms may accept them: matrix and list formats. 

3.1.1. Matrix-like file 

Each line in the text file contains a data matrix row. The characteristics 
of these files are: 

• The matrix must be symmetric, and the diagonal values must be zeros. 

• Within each row, the values can be separated by: spaces (' '), tab 
character, semicolon (';'), comma (','), or vertical bar ('('). 

• It is possible to include labels with the names of the nodes in an addi- 
tional first row or column, but not in both. 

• If present, the labels of the nodes cannot contain any of the previous 
separators. 
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3.1.2. List-like file 

Each line in the text file contains three fields, which represent the labels 
of two nodes and the distance (or weight) between them. The characteristics 
of these files are: 

• The separators between the three fields can be: spaces (' '), tab char- 
acter, semicolon (';'), comma (','), or vertical bar ('|'). 

• The labels of the nodes cannot contain any of the previous separators. 

• Distances from a node to itself (e.g. "a a 0.0") must not be included. 

• MultiDendrograms accepts either the presence or absence of symmetric 
data lines, i.e. if the distance between nodes a and b is 2.0, then it 
is possible to include in the list just the line "a b 2 . 0" , or both "a b 
2.0" and "b a 2.0". If both are present, the program checks whether 
they are equal. 

3.2. Loading data 

Once we have our data in a compatible format, we can load them into 
MultiDendrograms : 

1. Choose the desired settings, mainly the options for the Type of mea- 
sure and the Clustering algorithm. These settings will be explained 
in detail in subsection 13.41 

2. Click on the Load button. 

3. Select the file to open and then click on the Open button. 

4. Now the data is loaded and its multidendrogram representation is 
shown (see Fig. [TJ. 

3.3. Actions 

MultiDendrograms only has two action buttons: Load and Update. 
Load is used to read data from a file and create a new window for the 
corresponding multidendrogram, using the current values of the parameters, 
while Update is needed for the actualization of the active multidendrogram 
when one or more parameters are changed. Below these buttons it is shown 
the name of the data file corresponding to the active multidendrogram. It 
is possible to load the same data file several times, in order to compare the 
multidendrogram appearance for different parameters settings. 
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File 
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Settings 

Type of measure: '•' Distances V.-siahl* 
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Precision: 13 
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Figure 1: MultiDendrograms user interface. 



3.4- Settings 

The program automatically applies default values to the parameters de- 
pending on the data loaded, which should be adjusted as desired. Fig. [T] 
shows the settings panel in the left part of the window, with four differ- 
ent areas corresponding to main data representation, tree, nodes and axis 
settings, respectively. 

Changes in the main data representation parameters affect the structure 
of the multidendrogram tree, thus it needs to be fully recomputed, operation 
which may take several seconds, even minutes (depending on the data size 
and the computer speed). On the other hand, changes in the tree, nodes and 
axis settings only modify the visual representation of the multidendrogram 
and are much faster to update. 
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3.4-1 ■ Main data representation settings 

Type of measure: It allows choosing between two different types of mea- 
sure, Distances and Weights. Choose between them according to the 
meaning of the input data. With Distances, the closer the nodes the 
lower their distance. On the contrary, with Weights, the closer the 
nodes the larger their weight. By default, Distances is selected. 

Clustering algorithm: Seven distinct algorithms are available, Single Link- 
age, Complete Linkage, Unweighted Average, Weighted Aver- 
age, Unweighted Centroid, Weighted Centroid, and Ward. By 
default, Unweighted Average is selected. 

Precision: Number of significant digits of the data to be taken into account 
for the calculations. This is a very important parameter, since equal 
distances at a certain precision may become different by increasing its 
value. Thus, it may be responsible of the existence of tied distances. 
The rule should be not to use a precision larger than the resolution given 
by the experimental setup that has generated the data. By default, the 
precision is set to that of the data value with the largest number of 
significant digits. 

3-4-2. Tree settings 

Tree orientation: Four orientations are available, North, South, East, 
and West, which refer to the relative position of the root of the tree. 
By default, North is selected. 

Show bands: It allows showing a band or not in case of tied minimum 
distances between three or more elements, and selecting the color of the 
band. If Show bands is selected, the bands show the heterogeneity 
of distances between the clustered elements. Otherwise, the elements 
are grouped at their minimum distance (see Fig. [2]). By default, Show 
bands is selected and its default color is light gray. 

Let us explain the meaning of the bands. In MultiDendrograms, if sev- 
eral pairs of elements share the same minimum distance, they are clustered 
together in one step. For instance, suppose that the minimum distance is 0.4 
and it corresponds to the tied pairs (A, B) and (B, C). MultiDendrograms 
puts them together in the same cluster {A, B, C} at height 0.4. However, if 
the distance (A, C) is 0.5, it is possible to represent the cluster {A, B, C} 
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Show bands 
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Do not Show bands 



D lisljike txt Conplete L nkage 
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Figure 2: Possibility of showing a heterogeneity band in case of ties. 



as a rectangle which spans between heights 0.4 and 0.5, thus showing the 
heterogeneity of the clustered elements. 

3.4-3. Nodes settings 

Nodes size: Six different node sizes are available. By default, is selected 
(i.e. nodes are not shown). 

Show labels: It allows showing or not the labels of the nodes and selecting 
their color and font. By default, Show labels is selected, the font is 
Arial and the color is black. 

Labels orientation: Three different orientations are available, Vertical, 
Horizontal and Oblique. By default, Vertical is selected. 

3-4-4- Axis settings 

Show axis: It allows showing or not the axis and selecting its color. By 
default, Show axis is selected and the color is black. 

Minimum value / Maximum value: They allow choosing the minimum 
and maximum values of the axis, respectively. They also affect the 
view of the multidendrogram. The default values are calculated from 
the loaded data. 
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itst_like.txt ■ Complete I mhage 



UjSfiO 
0593 
0500 
0.450 
0400 
0350 
0300 
02HJ 
0200 
0.150 
0.100 
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Show ultrametric deviation measures 
Show dendrogram details 



Save ultrametric matrix as TXT 
Save dendrogram as TXT 
Save dendrogram as NewicKtree 
Save dendrogram as JPG 
Save dendrogram as PNG 
Save dendrogram as EPS 



Ultrametric deviation measures 



I a r 



listjike.ttt - Complete Linkage 

Cophenetic Correlation Coefficient: 0.720928 
Normalized Mean Squared Error: 0.163456 
Normalized Mean Absolute Error: 0.345472 

OK 



Figure 3: Contextual menu (left) and ultrametric deviation measures (right). 

Ticks separation: It allows choosing the separation between consecutive 
ticks of the axis. The default value is calculated from the loaded data. 

Show labels: It allows showing or not the labels of the axis, and selecting 
their font and color. By default, Show labels is selected, the font is 
Arial and the color is black. 

Labels every . . . ticks: Number of consecutive ticks to find the next la- 
beled tick. By default is set to 1. 

Labels decimals: Number of decimal digits of the tick labels. By default 
is set equal to the Precision parameter. 

3.5. Analyzing and exporting results 

The contextual menu, available by right- clicking on any multidendrogram 
window, gives access to several options for analyzing and exporting results 
to file (see Fig. [3). 

Show ultrametric deviation measures: Computes the ultrametric ma- 
trix corresponding to the active multidendrogram and obtains three 
deviation measures between the original matrix and the ultrametric 
one, which are the Cophenetic Correlation Coefficient, the Nor- 
malized Mean Squared Error, and the Normalized Mean Ab- 
solute Error (see Fig. [3]). 
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I=j| list_like,t3rt - Complete Linkage 



3 Root 

f C3 23 [0.514, 0.514] 
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Figure 4: Multidendrogram details window. 

Show dendrogram details: Opens a window that contains all the infor- 
mation of the multidendrogram in a navigable folder-like structure (see 
Fig. El). The available information in the details window is: 

• Number of data items (leafs of the tree) under each interior node 
of the multidendrogram. The interior nodes in the multidendro- 
gram representation correspond to the clusters found during the 
agglomerative process. 

• Minimum and maximum distances at which the children of an 
interior node are joined to form a new cluster. These values may 
only be different in case of tied minimum distances, and they 
become a band in the multidendrogram representation. 

• List of children for each interior node, which may be either interior 
nodes or data items (leafs). 

Save ultrametric matrix as TXT: Calculates the ultrametric matrix cor- 
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responding to the loaded data and saves it to a text file represented as a 
matrix, not as a list, with the labels of the nodes in the first row. This 
text file can then be easily loaded into any text editor or spreadsheet 
application. 

Save dendrogram as TXT: Saves the multidendrogram details to a text 
file. 

Save dendrogram as Newick tree: Saves the multidendrogram details in 
Newick tree format. In this format, the information given by the bands 
is lost and only the minimum distance is saved. However, it has the 
advantage that it is a standard format used by many other applications, 
thus allowing their use to generate other graphical representations. 

Save dendrogram as JPG / PNG / EPS: It is also possible to save the 
image of the multidendrogram in three different formats (JPG, PNG 
and EPS) using the corresponding Save dendrogram as . . . context 
menu items. 

3.6. Command-line direct calculation 

It is possible to use MultiDendrograms in command-line mode to calculate 
the dendrogram without the graphical interface. This is useful in several 
situations: 

• To automate the generation of many dendrograms using scripts. 

• When there is no need of a plot of the dendrogram. 

• When the plot of the dendrogram is to be performed with a different 
program. 

• When the number of elements is too large to allow a graphical repre- 
sentation. 

• To be able to call MultiDendrograms from a different application. 
The input parameters of a command-line call are: 

• The name of the input data file, in matrix or list format. 

• The type of measure: distances or weights. 
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• The clustering algorithm: single linkage, complete linkage, unweighted 
average, weighted average, unweighted centroid, weighted centroid or 
ward. 

• The precision, i.e. the number of significant digits of the data for the 
calculations. This parameter is optional, and if not given it is calculated 
from the data. However, the rule should be not to use a precision 
larger than the resolution given by the experimental setup which has 
generated the data. 

The output results are: 

• A file with the dendrogram tree in text format. 

• A file with the dendrogram in Newick format. 

• A file with the ultrametric matrix. 

• The ultrametric deviation measures: the cophenetic correlation coef- 
ficient, the normalized mean squared error, and the normalized mean 
absolute error. 

The syntax of a command-line direct calculation is: 

java -jar mult idendrograms . jar -direct FILE TYPE METHOD [PREC] 

And a concrete example for a distances matrix using the complete linkage 
method with 3 decimal significant digits is: 

java -jar mult idendrograms . jar -direct data.txt DISTANCES 

Complete_Linkage 3 



4. Case study 

We show here a case study performed with MultiDendrograms using data 



from a real example which had been previously studied by Morgan and Ray 



(1995), and Fernandez and Gomez (2008). It is the Glamorganshire soils 
example, formed by similarity data between 23 different soils. A table with 
the similarities between soils can be found in Morgan and Ray (1995), where 
the values are given with an accuracy of three decimal places. In order 
to work with dissimilarities as in Fernandez and Gomez (2008), we have 
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Figure 5: Complete Linkage dendrograms for the soils data. According to the brown earths 
soil group formed by soils 1, 2, 6, 12 and 13, the dendrogram in (a) is worse than the one 
in (b) because the former merges these soils at a posterior stage of the agglomerative 
process. 



transformed the similarities s(xi,Xj) into the corresponding dissimilarities 



d(xi 



even though MultiDendrograms is also capable of 



directly working with similarities. 

The original data contain a tied value for pairs of soils (3, 15) and (3, 
20), which is responsible for two different dendrograms using the Complete 
Linkage method (see Fig. [5]). Morgan and Ray ( 1995 ) explain that the 23 soils 
had been categorized into eight "great soil groups" by a surveyor. Focusing 
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Figure 6: Complete Linkage multidendrogram for the soils data. 



on soils 1, 2, 6, 12 and 13, which are the only members of the brown earths 
soil group, we observe that the dendrogram in Fig. |5](a) does not place them 
in the same cluster until they join soils from five other soil groups, forming 
the cluster {1, 2, 3, 20, 12, 13, 15, 5, 6, 8, 14, 18}. From this point of view, 
the dendrogram in Fig. ^b) is better, since the corresponding cluster loses 
soils 8, 14 and 18, each representing a different soil group. Therefore, in 
this case we have two possible solution dendrograms and the probability of 
obtaining the "good" one is, hence, 50%. 

We now use the MultiDendrograms software to obtain the multidendro- 
gram representation corresponding to the Glamorganshire soils data. We 
select Complete Linkage as the Clustering algorithm, and Precision 
= 3. We obtain a multidendrogram that we have saved in JPG format using 
the contextual menu. The result, shown in Fig. |6j unravels the existence of a 
tie comprising soils 3, 15 and 20. Besides, the multidendrogram gives us the 
good classification, that is, the one with soils 8, 14 and 18 out of the brown 
earths soil group. Except for the internal structure of the cluster {1, 2, 3, 
15, 20}, the rest of the multidendrogram hierarchy coincides with that of the 
dendrogram shown in Fig. ^b). 

5. Conclusions 

MultiDendrograms is a simple yet powerful software to make hierarchical 
clusterings of data, distributed under an Open Source license. It implements 
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the variable-group algorithm for agglomerative hierarchical clustering that 
solves the non-uniqueness problem found in the standard pair-group algo- 
rithm. This problem consists in obtaining different hierarchical classifications 
from a unique set of proximity data, when two or more minimum distances 
between different clusters are equal during the agglomerative process. In such 
cases, selecting a unique classification can be misleading. Software packages 
that do not ignore this problem fail to adopt a common standard with respect 
to ties, and many of them simply break ties in any arbitrary way. 

Starting from a distances (or weights) matrix, MultiDendrograms com- 
putes its dendrogram using the variable-group algorithm which groups more 
than two clusters at the same time when ties occur. Its main properties are: 

• When there are no ties, MultiDendrograms gives the same results as 
any pair-group algorithm. 

• It always gives a uniquely determined solution. 

• In the multidendrogram representation of the results, the occurrence of 
ties during the agglomerative process can be explicitly observed. Fur- 
thermore, the height of any fusion interval (the bands in the program) 
indicates the degree of heterogeneity inside the corresponding cluster. 

MultiDendrograms also allows the tuning of many graphical representa- 
tion parameters, and the results can be easily exported to file. A summary 
of its characteristics is: 

• Multiplatform: developed in Java, runs in all operating systems (e.g. 
Windows, Linux and MacOS). 

• Graphical user interface: data selection, hierarchical clustering op- 
tions, multidendrogram representation parameters, navigation across 
the multidendrogram, deviation measures, etc. 

• Also command-line direct calculation without graphical user interface. 

• Implementation of variable-group algorithms for agglomerative hierar- 
chical clustering. 

• Works with distance and weight matrices. 

• Many parameters for the customization of the dendrogram layout: size, 
orientation, labels, axis, etc. 
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• Navigation through the dendrogram information in a folder-like win- 
dow. 

• Calculation of corresponding ultrametric matrix. 

• Calculation of deviation measures: cophenetic correlation coefficient, 
normalized mean squared error, and normalized mean absolute error. 

• Save dendrogram details in text and Newick tree format. 

• Save dendrogram image as JPG, PNG and EPS. 

Although ties need not be present in the initial proximity data, they may 
arise during the agglomerative process. For this reason, and given that the 
results of the variable-group algorithm coincide with those of the pair-group 
algorithm when there are no ties, we recommend to directly use MultiDen- 
drograms. With a single action one knows whether ties exist or not, and 
additionally the subsequent hierarchical classification is obtained. 
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